cross-lingual word
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages
Hangya, Viktor, Severini, Silvia, Ralev, Radoslav, Fraser, Alexander, Schütze, Hinrich
Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good cross-lingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (<5M tokens) and 4 moderately low-resource (<50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.
A Call for More Rigor in Unsupervised Cross-lingual Learning
Artetxe, Mikel, Ruder, Sebastian, Yogatama, Dani, Labaka, Gorka, Agirre, Eneko
In work implicitly includes monolingual and natural language processing, the main promise of cross-lingual signals that constitute a departure multilingual learning is to bridge the digital language from the pure setting. We review existing training divide, to enable access to information and signals as well as other signals that may be technology for the world's 6,900 languages (Ruder of interest for future study (§4). We then discuss et al., 2019). For the purpose of this paper, we methodological issues in UCL (e.g., validation, hyperparameter define "multilingual learning" as learning a common tuning) and propose best evaluation model for two or more languages from raw practices (§5). Finally, we provide a unified outlook text, without any downstream task labels. Common of established research areas (cross-lingual use cases include translation as well as pretraining word embeddings, deep multilingual models and multilingual representations. We will use the term unsupervised machine translation) in UCL (§6), interchangeably with "cross-lingual learning".
A Survey of Cross-lingual Word Embedding Models
Ruder, Sebastian, Vulić, Ivan, Søgaard, Anders
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
Bilingual Lexicon Induction through Unsupervised Machine Translation
Artetxe, Mikel, Labaka, Gorka, Agirre, Eneko
A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.
Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models
Wada, Takashi, Iwata, Tomoharu
We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as an input. The proposed model contains bidirectional LSTMs that perform as forward and backward language models, and these networks are shared among all the languages. The other parameters, i.e. word embeddings and linear transformation between hidden states and outputs, are specific to each language. The shared LSTMs can capture the common sentence structure among all languages. Accordingly, word embeddings of each language are mapped into a common latent space, making it possible to measure the similarity of words across multiple languages. We evaluate the quality of the cross-lingual word embeddings on a word alignment task. Our experiments demonstrate that our model can obtain cross-lingual embeddings of much higher quality than existing unsupervised models when only a small amount of monolingual data (i.e.
On the Limitations of Unsupervised Bilingual Dictionary Induction
Søgaard, Anders, Ruder, Sebastian, Vulić, Ivan
Unsupervised machine translation---i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora---seems impossible, but nevertheless, Lample et al. (2018) recently proposed a fully unsupervised machine translation (MT) model. The model relies heavily on an adversarial, unsupervised alignment of word embedding spaces for bilingual dictionary induction (Conneau et al., 2018), which we examine here. Our results identify the limitations of current unsupervised MT: unsupervised bilingual dictionary induction performs much worse on morphologically rich languages that are not dependent marking, when monolingual corpora from different domains or different embedding algorithms are used. We show that a simple trick, exploiting a weak supervision signal from identical words, enables more robust induction, and establish a near-perfect correlation between unsupervised bilingual dictionary induction performance and a previously unexplored graph similarity metric.